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Abstract 

Compressed  Sensing  seeks  to  capture  a  discrete  signal  x  G  MN  with  a  small 
number  n  of  linear  measurements.  The  information  captured  about  x  from  such 
measurements  is  given  by  the  vector  y  =  4>x  G  Mn  where  $  is  an  n  x  N  matrix. 

The  best  matrices,  from  the  viewpoint  of  capturing  sparse  or  compressible  signals, 
are  generated  by  random  processes,  e.g.  their  entries  are  given  by  i.i.d.  Bernoulli 
or  Gaussian  random  variables.  The  information  y  holds  about  x  is  extracted  by  a 
decoder  A  mapping  Mn  into  1RN .  Typical  decoders  are  based  on  i \  -minimization 
and  greedy  pursuit.  The  present  paper  studies  the  performance  of  decoders  based 
on  thresholding.  For  quite  general  random  families  of  matrices  <b,  decoders  A 
are  constructed  which  are  instance-optimal  in  probability  by  which  we  mean  the 
following.  If  x  is  any  vector  in  MN ,  then  with  high  probability  applying  A  to  y  =  4>x 
gives  a  vector  x  :=  A (y)  such  that  ||x— x||  <  Co<Jk{x)e2  for  all  k  <  an/  log  N  provided 
a  is  sufficiently  small  (depending  on  the  probability  of  failure).  Here  ak{x)i2  is  the 
error  that  results  when  x  is  approximated  by  the  k  sparse  vector  which  equals  x  in 
its  k  largest  coordinates  and  is  otherwise  zero.  It  is  also  shown  that  results  of  this 
type  continue  to  hold  even  if  the  measurement  vector  y  is  corrupted  by  additive 
noise:  y  =  4>x  +  e  where  e  is  some  noise  vector.  In  this  case  cq,(x)^2  is  replaced  by 
<rk(x)t2  +  ||e|W 
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1  Introduction 

1.1  Background 

The  typical  paradigm  for  acquiring  a  compressed  representation  of  a  discrete  signal  x  G 
1Rn ,  N  large,  is  to  choose  an  appropriate  basis,  compute  all  of  the  coefficients  of  x  in  this 
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basis,  and  then  retain  only  the  k  largest  of  these  with  k  <  N.  Without  loss  of  generality, 
we  can  assume  that  the  appropriate  basis  is  the  canonical  Kroenecker  delta  basis.  If 
Sk  C  {1,  •  •  • ,  N}  denotes  a  set  of  indices  corresponding  to  k  largest  entries  in  x,  then  xsk 
is  the  compressed  approximation  to  x.  Here  and  throughout  this  paper,  for  a  set  T  of 
indices,  we  denote  by  xt  the  vector  which  is  identical  to  x  on  T  but  is  zero  outside  T . 

For  any  £p  norm,  this  approximation  process  is  equivalent  to  best  fc-term  approxima¬ 
tion.  Namely,  if 

:=  {z  E  Mn  :  #(supp (z))  <  k},  (1.1) 

where  supp(z)  is  the  number  of  nonzero  entries  in  z,  and  if  for  any  norm  ||  •  ||v  on  1RN , 
we  define 

crk(x)x  ■=  inf  \\x-z\\x,  (1-2) 

z€.Yjk 

then  ||a;  —  xsk\\ep  =  =  <Xk(x)ep-  That  is,  xsk  is  a  best  approximation  to  x  from  Efc. 

This  approximation  process  should  be  considered  as  adaptive  since  the  indices  of  those 
coefficients  which  are  retained  vary  from  one  signal  to  another. 

Since,  in  the  end,  we  retain  only  k  entries  of  x  in  the  above  compression  paradigm, 
it  seems  wasteful  to  initially  make  N  measurements.  The  theory  of  compressed  sensing 
as  formulated  by  Candes,  Romberg  and  Tao  [8,  9]  and  by  Donoho  [14],  asks  whether  it 
is  possible  to  actually  make  a  number  n  of  non- adaptive  linear  measurements,  with  n 
comparable  to  k,  and  still  retain  the  necessary  information  about  x  in  order  to  build  a 
good  compressed  approximation.  These  measurements  are  represented  by  a  vector 

V  =  ®x,  (1.3) 

of  dimension  n  <  N  where  <f>  is  an  n  x  N  measurement  matrix  (called  a  CS  matrix). 
To  extract  the  information  that  the  measurement  vector  y  holds  about  x,  one  uses  a 
decoder  A  which  is  a  mapping  from  ]Rn  into  ]RN .  The  vector  x*  :=  A (y)  =  A(<ha;)  is  our 
approximation  to  x  extracted  from  the  information  y.  In  contrast  to  <f>,  the  operator  A 
is  allowed  to  be  non-linear. 

In  recent  years,  considerable  progress  has  been  made  in  understanding  the  perfor¬ 
mance  of  various  choices  of  the  measurement  matrices  <f>  and  decoders  A.  Although  not 
exclusively,  by  far  most  contributions  focus  on  the  ability  of  such  an  encoder-decoder  pair 
($,  A)  to  recover  a  sparse  signal.  For  example,  a  typical  theorem  says  that  there  are  pairs 
(<£>,  A)  such  that  whenever  x  E  S &,  with  k  <  an/  log (N/k),  then  x*  =  x. 

From  both  a  theoretical  and  a  practical  perspective,  it  is  highly  desirable  to  have  pairs 
(<E>,  A)  that  are  robust  in  the  sense  that  they  are  effective  even  when  the  vector  x  is  not 
assumed  to  be  sparse.  The  question  arises  as  to  how  we  should  measure  the  effectiveness 
of  such  an  encoder-decoder  pair  (<f>,  A)  for  non-sparse  vectors.  In  [6]  we  have  proposed  to 
measure  such  performance  in  a  metric  ||  •  \\x  by  the  largest  value  of  k  for  which 

\\x ~  A($x)||x  <  C0ak(x)x,  VxelRN,  (1.4) 

with  C0  a  constant  independent  of  k,  n,  N.  We  say  that  a  pair  (<E>,  A)  which  satisfies 
property  (1.4)  is  instance- optimal  of  order  k  with  constant  Co-  It  was  shown  that  this 
measure  of  performance  heavily  depends  on  the  norm  employed  to  measure  error.  Let  us 
illustrate  this  by  two  contrasting  results  from  [6]: 
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(i)  If  ||  •  || x  is  the  I? r norm,  it  is  possible  to  build  encoding-decoding  pairs  (<h,A) 
which  are  instance-optimal  of  order  k  with  a  suitable  constant  C'0  whenever  n  > 
ck  log(N/k)  provided  c  and  Co  are  sufficiently  large.  Moreover  the  decoder  A  can 
be  taken  as 

A(r/)  :=  argmin  H^l^.  (1.5) 

&z=y 

Therefore,  in  order  to  obtain  the  accuracy  of  h-terrn  approximation,  the  number 
n  of  non-adaptive  measurements  need  only  exceed  the  amount  k  of  adaptive  mea¬ 
surements  by  the  small  factor  c\og(N/k).  We  shall  speak  of  the  range  of  k  which 
satisfy  k  <  an/\og(N/k)  as  the  large  range  since  it  is  the  largest  range  of  k  for 
which  instance-optimality  can  hold. 

(ii)  In  the  case  ||  •  ||x  is  the  £2-norm,  if  (<h,  A)  is  any  encoding-decoding  pair  which 
is  instance-optimal  of  order  k  =  1  with  a  fixed  constant  Co,  then  the  number  of 
measurement  n  is  always  larger  than  aN  where  a  >  0  depends  only  on  Co-  Therefore, 
the  number  of  non-adaptive  measurements  has  to  be  very  large  in  order  to  compete 
with  even  one  single  adaptive  measurement. 

The  matrices  <f>  which  have  the  largest  range  of  instance-optimality  for  A  are  all  given 
by  stochastic  constructions.  Namely,  one  creates  an  appropriate  random  family  <h(cu)  of 
n  x  N  matrices  on  a  probability  space  (fl,  p)  and  then  shows  that  with  high  probability 
on  the  draw,  the  resulting  matrix  $  =  <3>(u;)  will  satisfy  instance-optimality  for  the  large 
range  of  k.  There  are  no  known  deterministic  constructions.  The  situation  is  even  worse 
in  the  sense  that  given  an  n  x  N  matrix  <h  there  is  no  simple  method  for  checking  its 
range  of  instance-optimality. 

While  the  above  results  show  that  instance-optimality  is  not  a  viable  concept  in  £2,  it 
turns  out  that  the  situation  is  not  as  bleak  as  it  seems.  For  example,  a  more  optimistic 
result  was  established  by  Candes,  Romberg  and  Tao  in  [9].  They  show  that  if  n  > 
ck  log(N/k)  it  is  possible  to  build  pairs  (<h,  A)  such  that  for  all  x  E  IRN, 

||x-A(fe)lk<C„AA,  (1.6) 

with  the  decoder  again  defined  by  (1.5).  This  implies  in  particular  that  /c-sparse  signals 
are  exactly  reconstructed  and  that  signals  x  in  the  space  weak  ip  (denoted  by  w£p )  with 
|  |x  1 1  wip  A  M  for  some  p  <  1  are  reconstructed  with  accuracy  C0Mk~s  with  s  =  1/p  —  1/2. 
This  bound  is  of  the  same  order  as  the  best  estimate  available  on  max  (oy.(a;)^2  :  ||x||^p  < 
M}.  Of  course,  this  result  still  falls  short  of  instance-optimality  in  £2  as  it  must. 

The  starting  point  of  the  present  paper  is  the  intriguing  fact,  that  instance-optimality 
can  be  attained  in  0  if  one  accepts  a  probabilistic  statement.  A  first  result  in  this 
direction,  obtained  by  Cormode  and  Mutukrishnan  in  [7] ,  shows  how  to  construct  random 
n  x  N  matrices  <f>(c<;)  and  a  decoder  A  =  A  (a;),  w  e  h,  such  that  for  any  x  E  1RN , 

Ik  -  A(<Fr)|k  <  C0ak(x)e2  (1.7) 

holds  with  overwhelming  probability  (larger  than  1  —  e(n)  where  e(n)  tends  rapidly  to  0 
as  n  — >  Too)  as  long  as  k  <  an/ (log  N)5^2  with  a  suitably  small.  Note  that  this  result 
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says  that  given  x,  the  set  of  u  E  fl  for  which  (1.7)  fails  to  hold  has  small  measure.  This 
set  of  failure  will  depend  on  x. 

From  our  viewpoint,  instance- optimality  in  probability  is  the  proper  formulation  in  t2. 
Indeed,  even  in  the  more  favorable  setting  of  G,  we  can  never  put  our  hands  on  matrices 
<f>  which  have  the  large  range  of  instance-optimality.  We  only  know  with  high  probability 
on  the  draw,  in  certain  random  constructions,  that  we  can  attain  instance-optimality.  So 
the  situation  in  i2  is  not  that  much  different  from  that  in  G- 

The  results  in  [6]  pertaining  to  instance-optimality  in  probability  asked  two  funda¬ 
mental  questions:  (i)  can  we  attain  instance-optimality  for  the  largest  range  of  k,  i.e. 
k  <  an/\og(N/k),  and  (ii)  what  are  the  properties  of  random  families  that  are  needed 
to  attain  this  performance.  We  showed  that  instance-optimality  can  be  obtained  in  the 
probabilistic  setting  for  the  largest  range  of  k,  i.e.  k  <  an/\og(N/k)  using  quite  general 
constructions  of  random  matrices.  Namely,  we  introduced  two  properties  for  a  random 
matrix  <f>  which  ensure  instance-optimality  in  the  above  sense  and  then  showed  that  these 
two  properties  hold  for  rather  general  constructions  of  random  matrices  (such  as  Gaussian 
and  Bernoulli).  However,  one  shortcoming  of  the  results  in  [6]  is  that  the  decoder  used  in 
establishing  instance-optimality  was  defined  by  minimizing  \\y  —  <ha;| \i2  over  all  /c-sparse 
vectors,  a  task  which  cannot  be  achieved  in  any  reasonable  computational  time. 

1.2  Objectives 

In  the  present  paper,  we  shall  be  interested  in  which  practical  decoders  can  be  used  with 
a  general  random  family  so  as  to  give  a  sensing  system  which  has  instance-optimality  in 
probability  for  l2  for  the  largest  range  of  k.  The  first  result  in  this  direction  was  given  by 
Wojtasczcek  [24]  who  has  shown  that  7' i -minimization  can  be  used  with  Gaussian  random 
matrices  to  attain  instance-optimality  for  this  large  range  of  k.  This  result  was  recently 
generalized  in  [12]  to  arbitrary  random  families  in  which  the  entries  of  the  matrix  are 
generated  by  independent  draws  of  a  sub-gaussian  random  variables.  This  result  includes 
Bernoulli  matrices  whose  entries  take  the  values  ±l/^/n. 

The  problem  of  decoding  in  compressed  sensing,  as  well  as  for  more  general  inverse 
problems,  is  a  very  active  area  of  research.  In  addition  to  G -minimization  and  its  efficient 
implementation,  several  alternatives  have  been  suggested  as  being  possibly  more  efficient. 
These  include  decoding  based  on  greedy  procedures  such  as  Orthogonal  Matching  Pursuit 
(OMP)  (see  [15,  20,  21,  22])  as  well  as  decoding  through  weighted  least  squares  [11].  Some 
of  the  pertinent  issues  in  analyzing  a  decoding  method  is  the  efficiency  of  the  method 
(number  of  computations)  and  the  required  storage  needed. 

Concerning  efficiency,  Gilbert  and  Tropp  [15]  have  proposed  to  use  a  greedy  procedure, 
known  as  Orthogonal  Matching  Pursuit  (OMP)  algorithm,  in  order  to  define  A (y).  The 
greedy  algorithm  identifies  a  set  of  A  of  column  indices  which  can  be  used  to  decode  y. 
Taking  zero  as  an  initial  guess,  successive  approximations  to  y  are  formed  by  orthogo¬ 
nally  projecting  the  measurement  vector  y  onto  the  span  of  certain  incrementally  selected 
columns  (j)j  of  <f>.  In  each  step,  the  current  set  of  columns  is  expanded  by  one  further 
column  that  maximizes  the  modulus  of  the  inner  product  with  the  current  residual.  The 
following  striking  result  was  proved  in  [15]  for  a  probabilistic  setting  for  general  random 
matrices  which  include  the  Bernouli  and  Gaussian  families:  if  n  >  cklogN  with  c  suffi- 
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ciently  large,  then  for  any  k  sparse  vector  x,  the  OMP  algorithm  returns  exactly  xk  =  x 
after  k  iterations,  with  probability  greater  than  1  —  N~b  where  b  can  be  made  arbitrarily 
large  by  taking  c  large  enough. 

Decoders  like  OMP  are  of  high  interest  because  of  their  efficiency.  The  above  result  of 
Gilbert  and  Tropp  remains  as  the  only  general  statement  about  OMP  in  the  probabilistic 
setting.  A  significant  breakthrough  on  decoding  using  greedy  pursuit  was  given  in  the 
paper  of  Needcl  and  Vershynin  [20]  (see  also  their  followup  [21])  where  they  showed 
the  advantage  of  adjoining  a  batch  of  coordinates  at  each  iteration  rather  than  just  one 
coordinate  as  in  OMP.  They  show  that  such  algorithms  can  deterministically  capture 
sparse  vectors  for  a  slightly  smaller  range  than  the  large  range  of  k. 

The  present  paper  examines  decoders  based  on  thresholding  and  asks  whether  such 
algorithms  can  be  used  as  decoders  to  yield  £2  instance-optimality  in  probability  for  general 
families  of  random  matrices.  We  will  describe  in  Section  6  a  greedy  thresholding  scheme, 
referred  to  as  SThresh,  and  prove  that  it  gives  instance-optimality  in  probability  in  £2 
for  the  large  range  of  k.  This  algorithm  adds  a  batch  of  coordinates  at  each  iteration 
and  then  uses  a  thinning  procedure  to  possibly  remove  some  of  them  at  later  iterations. 
Conceptually,  one  thinks  in  terms  of  a  bucket  holding  all  of  the  coordinates  to  be  used 
in  the  construction  of  x.  In  the  analysis  of  such  algorithms  it  is  important  to  not  allow 
more  than  a  multiple  of  k  coordinates  to  gather  in  the  bucket.  The  thinning  is  used  for 
this  purpose. 

While  preparing  this  paper,  we  became  aware  of  the  work  of  Needel  and  Tropp  [22] 
in  which  they  develop  a  deterministic  algorithm  (called  COSAMP)  which  has  features 
similar  to  ours.  In  fact,  we  have  employed  some  of  the  ideas  of  that  paper  in  our  analysis. 
This  will  be  discussed  in  more  detail  after  we  give  a  precise  description  of  our  algorithm. 

While  the  benchmark  of  instance-optimality  covers  the  case  of  an  input  signal  x  which 
is  a  perturbation  of  a  sparse  signal,  it  is  not  quite  appropriate  for  dealing  with  possible 
noise  in  the  measurements.  By  this  we  mean  that  instead  of  measuring  <ha:,  our  measure¬ 
ment  vector  y  is  of  the  form 

y  =  (hr  +  e,  (1.8) 

with  e  G  lRn  a  noise  vector.  SThresh  will  also  perform  well  in  this  noisy  setting.  Stability 
under  noisy  measurements  has  been  also  established  for  COSAMP  ([22])  as  well  as  for 
schemes  based  on  £ j  -regularization  [9].  While  this  latter  strategy  requires  a-priori  knowl¬ 
edge  about  the  noise  level,  this  is  not  the  case  for  COSAMP  and  the  schemes  developed 
in  this  paper. 

A  brief  overview  of  our  paper  is  the  following.  In  the  next  section,  we  introduce  the 
probabilistic  properties  we  will  require  of  our  random  families.  In  §3,  we  introduce  a  de¬ 
terministic  algorithm  based  on  thresholding  and  analyze  its  performance.  This  algorithm 
is  then  used  as  a  basic  step  in  the  greedy  decoding  algorithm  for  stochastic  families  in  the 
following  section  §4.  In  this  section,  we  prove  that  the  stochastic  decoding  algorithm  gives 
instance  optimality  in  probability.  As  we  have  noted  above,  a  key  step  in  this  decoding  is 
a  thinning  of  the  indices  placed  into  the  bucket.  It  is  an  intriguing  question  whether  this 
thinning  is  actually  necessary.  This  leads  us  to  consider  an  algorithm  without  thinning. 
We  introduce  such  an  algorithm  in  §6  and  we  show  in  §7  that  almost  gives  instance- 
optimality  in  probability  for  i2  for  the  large  range  of  k.  The  results  for  that  algorithm 
are  weaker  than  the  thinning  algorithms  in  two  ways.  First  they  require  the  addition  of 
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a  small  term  e  to  crk(x)e2  and  secondly  the  range  of  k  is  slightly  smaller  than  the  large 
range.  Finally,  we  append  in  §8  the  proof  that  random  matrices  such  as  Gaussian  and 
Bernoulli  as  well  as  uniform  vectors  on  the  unit  sphere  satisfy  the  properties  which  are 
used  in  the  analysis  of  both  algorithms. 

While  a  lot  of  progress  has  been  made  on  understanding  the  performance  of  greedy 
algorithms  for  decoding  in  compressed  sensing,  there  remain  fundamental  unsettled  ques¬ 
tions.  The  most  prominent  is  whether  the  original  OMP  algorithm  can  indeed  give  in¬ 
stance  optimality  in  probability  for  £2  for  the  large  range  of  k. 

2  The  Setting 

As  we  have  already  mentioned,  one  of  our  goals  is  to  derive  results  that  hold  for  general 
random  families.  In  this  section,  we  state  general  properties  of  random  families  which 
will  be  used  as  assumptions  in  our  theorems. 

We  consider  random  n  x  N  matrices  <h  =  $(u;),  on  a  probability  space  (G,p).  We 
denote  the  entries  in  <h  by  (f>ij,  1  <  i  <  n,  1  <  j  <  N  and  denote  the  j'-tli  column  of  <h  by 
(f>j,  j  =  1 , ...  ,7V.  One  of  the  main  properties  needed  of  random  families  for  compressed 
sensing  is  that  given  any  x  G  IRN,  with  high  probability  <ha:  has  norm  comparable  to  that 
of  x.  We  formulate  this  in 

PI:  For  any  x  G  1Rn  and  5  >  0,  there  is  a  set  £li(x,  8)  C  Q  such  that 

lll'Ml! -]Ml!l  <  s\\x\\h'  (2.1) 

and 

<bie-^n5\  (2.2) 

where  b\  and  ci  are  absolute  constants. 

An  important  consequence  of  property  PI,  often  used  in  compressed  sensing,  is  the 
following  Restricted  Isometry  Property  (RIP),  as  formulated  by  Candes  and  Tao  [8]: 

RIP (k,  rj):  An  n  x  N  matrix  <h0  is  said  to  satisfy  the  Restricted  Isometry  Property  of 
order  m  with  constant  rj  G  (0, 1),  if 

(1  -  ri)\\x\\2  <  ||$0x||2  <  (1  +rj)\\x\\2,  x  G  Em  (2.3) 

It  was  shown  in  [3]  that  PI  implies  RIP.  More  precisely,  their  analysis  gives  the 
following  fact  (which  will  also  be  proved  in  the  Appendix  §8). 

Proposition  2.1  Whenever  the  random  family  =  <3>(u;);  u  G  Q,  of  n  x  N  matrices 
satisfies  PI,  then  for  each  77  G  (0, 1)  there  exists  a  subset  fl0(' m,rj,  $)  C  O  with 

p(n0(m,rj)c)  <  61e-£^+m[log(eiV/m)+log(12/??)]  (2.4) 

where  61,  C\  are  the  constants  from.  PI,  such  that  for  each  draw  to  G  Go('m,  v)  th e  matrix 
satisfies  RIP (771,  77)  (order  m,  with  constant  rj).  In  particular,  given  77,  if  a  is  chosen 
suitably  small  (depending  on  rj)  then  with  high  probability  will  satisfy  RIP  (777,77)  as 
long  as  m  <  an/  \og(N/m),  i.e  for  the  large  range  of  m. 
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Later  we  shall  have  to  apply  RIP  to  different  groups  of  random  matrivces  Includ¬ 
ing  <F  as  a  parameter  in  sets  of  type  O0  will  indicate  which  group  of  matrices  we  refer  to 
when  invoking  RIP. 

3  A  deterministic  thresholding  algorithm 

In  this  section,  we  shall  introduce  a  deterministic  thresholding  algorithm.  Later,  we 
shall  embed  this  algorithm  into  the  probabilistic  setting  and  show  that  the  corresponding 
probabilistic  algorithm  has  i 2  instance  optimality  in  probability. 

We  continue  to  denote  by  k  the  envisaged  range  of  instance  optimality.  We  shall 
assume  throughout  this  section  that  <F  is  an  nxN  compressed  sensing  matrix  that  satisfies 
the  RIP  (m,  rj)  where  m  >  3 k  is  an  integer  which  will  be  specified  later.  For  the  validity 
of  the  theorems  that  follow,  there  will  also  be  a  restriction  that  r]  is  sufficiently  close  to  0. 

3.1  Description  of  the  thresholding  algorithm  and  main  result 

In  this  section,  we  shall  describe  our  thresholding  algorithm.  The  algorithm  starts  with 
an  input  vector  y  G  Mn  and  generates  a  set  A  of  at  most  k  indices.  The  input  vector 
y  is  either  y  =  $2;  in  the  noiseless  case  or  y  =  Qx  +  e  in  the  presence  of  noise  e  in  the 
measurements.  The  output  of  the  algorithm  is  a  vector  x*  which  is  an  approximation  to 
x  determined  by  the  noisy  information  y. 

We  now  describe  our  thresholding  algorithm  for  decoding  an  input  vector  v  G  JRn  of 
either  type: 

DThresh[u,  k,  <5]  — ■>  x* 

(i)  Fix  a  thresholding  parameter  5  >  0.  Choose  the  sparsity  index  k,  let  r°  :=  v, 
x°  :=  0,  and  set  j  =  0,  A0  =  A0  =  0. 

(ii)  If  j  —  k  stop  and  set  x*  x3 . 

(iii)  Given  A  j  calculate  the  residual  r-7  :=  v  —  ‘Fad  for  the  input  vector  v  and  define 

A,+1  :=  {i€{l,...,/V}:|(H,«|>^Al} 

If  Aj+i  =  0,  stop  and  output  A*  =  A  j  and  x*  :=  x] . 

Otherwise  set  AJ+]  :=  Aj  U  AJ+i . 

(iv)  Compute  x(Aj+i)  (according  to  (5.13))  as 


£(Ai+i)  =  argminsupp(z)cA.+1||$z  -  v||, 

and  define  AJ+i  as  the  set  indices  v  G  AJ+i  corresponding  to  the  k  largest  (in 
absolute  value)  entries  in  x(Aj+i).  Let  x^+1  :=  x(Aj+1)a-+1,  j  +  1  j  and  return 
to  (ii). 
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Step  (iii)  is  a  thinning  step  which  prevents  the  bucket  of  indices  to  get  too  large  so 
that  in  our  analysis  RIP (77,  m)  will  turn  out  to  remain  applicable  for  a  fixed  suitable 
multiple  m  of  k. 

Perhaps  a  few  remarks  concerning  a  comparison  with  COSAMP  are  in  order.  In  both 
schemes  any  a  priori  knowledge  about  the  noise  level  is  not  needed  but  the  envisaged  spar¬ 
sity  range  k  appears  as  a  parameter  in  the  scheme.  This  is  in  contrast  to  £  1  -regularization 
in  [9]  which,  however,  does  seem  to  require  a  priori  knowledge  about  the  noise  level.  Of 
course,  one  can  take  k  as  the  largest  value  for  which  the  scheme  can  be  shown  to  perform 
well.  The  subsequent  analysis  will  show  that  this  is  indeed  the  case  for  the  maximal  range. 

While  DThresh  as  well  as  COSAMP  are  based  on  thresholding,  COSAMP  from  the 
very  beginning  always  works  with  least  squares  projections  of  size  2k.  In  the  above  scheme 
the  sets  of  active  indices  A j  are  allowed  to  grow  and,  in  fact,  the  scheme  may  terminate 
before  they  ever  reach  size  k. 

The  following  theorem  summarizes  the  convergence  properties  of  DThresh. 

Theorem  3.1  Assume  that  8,17  <  1/32  and  that  the  matrix  <f>  satisfies  RIP(m,  ?7)  with 
m  >  \k(l  +  ^§2  )1  ■  Then  for  any  x  £  1RN  andy  =  §x-\-e  the  output  x*  o/DThresh[|/,  £;,  <5] 
has  the  following  properties: 

(i)  If  in  addition  x  £  Efc,  then  the  output  x*  satisfies 

||a;  —  x*\\  <  90||e||.  (3.1) 

(ii)  If  x  £  IRn  and  xsk  is  its  best  approximation  from  E^,  i.e.  the  indices  in  Sk  identify 
the  k  largest  terms  (in  absolute  value)  in  x,  then 

\\x  -  a* ||  <  90[||$(a  -  xSk)\\  +  ||e||].  (3.2) 

(iii)  For  arbitrary  x  £  IRN ,  one  has 

||a-a*||  <  90^(1  +  h)1/2(  +<7k(x)e»')  +  llell)-  (3-3) 

We  postpone  the  proof  of  Theorem  3.1  to  Section  5  and  explain  first  its  ramifications 
in  the  stochastic  setting. 

4  Thresholding  in  the  stochastic  setting 

Let  us  now  assume  that  <3>(u;),  u  £  it,  is  a  random  family  of  matrices  which  satisfy  PI. 
As  we  have  shown  in  Proposition  2.1,  with  high  probability  on  the  draw  (see  (2.4)),  <3>(u;) 
will  satisfy  RIP  (771,77),  m  a  fixed  multiple  of  k,  for  the  large  range  of  k ,  with  constant 
a  depending  on  that  multiple  and  on  77.  We  shall  use  the  following  stochastic  version 
SThresh  of  the  thresholding  algorithm  which  differs  from  DThresh  only  in  the  initial¬ 
ization  step  (i). 


SThresh[u,  k,  h] 


x 


(i)  Fix  a  thresholding  parameter  5  >  0  and  the  sparsity  index  k.  Given  any  signal 
x  G  ]Rn  take  a  random  draw  =  $(u;)  and  consider  as  input  the  measurement 
vector  v  =  <ha:  +  e  G  Mn  where  e  is  a  noise  vector.  Let  r°  :=  v,  and  set  j  =  0, 

Ao  =  A0  =  0. 

(ii)  If  j  =  k  stop  and  set  x*  :=  xK 

(iii)  Given  A  j  calculate  the  residual  r-7  v  —  for  the  input  vector  v  and  define 

AJ+1  :=  {te{l,...,JV}:|<r^)|>^} 

If  Aj+i  =  0,  stop  and  output  A*  =  A  j  and  x*  :=  x] . 

Otherwise  set  AJ+i  :=  Aj  U  AJ+] . 

(iv)  Compute  x(Aj+1)  (according  to  (5.13))  as 

£(Aj+i)  =  argminsuppWgA,+i||$^  -  u||, 

and  define  AJ+i  as  the  set  indices  v  G  AJ+i  corresponding  to  the  k  largest  (in 
absolute  value)  entries  in  £(AJ+1).  Let  xJ+1  :=  x(Aj+1)a-+1,  j  +  1  — >  j  and  return 
to  (ii). 

Notice  that  the  output  x*  =  x*(u)  is  stochastic.  From  the  analysis  of  the  previous 
section,  we  can  deduce  the  following  theorem. 

Theorem  4.1  Assume  that  5  <  1/63  in  SThresh  and  that  the  stochastic  matrices  <3>(u;) 
have  property  PI.  Then,  for  any  x  G  MN  there  exists  a  subset  Q(x)  of  Vt  with 

p{n{x)c)  <  2 6ie-cin/8'632,  (4.1) 

such  that  for  any  u  G  Q(x)  and  measurements  of  the  form  y  =  <3>(u;):e  +  e,  with  e  G  Mn 
a  noise  vector,  the  output  x*  of  SThreshfr/,  S,  k]  satisfies 

||x  —  x*||  <  Cak{x)  +  90 1 1 e 1 1 ,  k  <  an/  \og(N/n),  (4.2) 

with  C  <  92  and  a  depending  only  on  5,  c\  and  the  bound  on  rj. 

In  particular,  when  e  =  0  this  algorithm  is  instance- optimal  in  probability  in  £2  for  the 
large  range  of  k. 

Proof:  Fixing  rj  =  1/63  and  m  =  [(1  +  f^j)k~\  we  know  by  Proposition  2.1  that  there 
exists  a  set  f20  C  O  such  that  for  eo  G  T20  the  matrix  =  <l>(a;)  satisfies  RIP(m,  1/63)  and 

p(nc0)  <  bie~ ^6^  +m i log 756+los(eAr/m)] .  (4.3) 

Thus,  as  long  as  N  >  756 m/e  it  suffices  to  have  2m\og(eN/m)  <  cyn/8  ■  632,  to  ensure 
that 

c-^n 

p(Qc0)  <  b1e~s-632  ,  whenever  k  <  an/  log (N/k)  (4.4) 
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provided  a  is  sufficiently  large.  Thus,  we  infer  from  Theorem  3.1  (ii)  that 


||z  -  ®*||  <  90(||$(a  -  xSk)\\  +  ||e||)  (4.5) 

holds  for  every  u  G  O0.  Now,  by  Property  PI,  there  exists  a  subset  Qi(xcs  ,  1/63)  with 
complement 

such  that  ||<l> (a;  —  xsfc)||  <  1.013||a;  —  xsk  ||  which  ensures  the  validity  of  (4.2)  with  fi(a;)  := 
ft0nfti(a£fc,l/63).  □ 


5  Proof  of  Theorem  3.1 

We  begin  by  collecting  a  few  prerequisites. 


5.1  Consequences  of  RIP 

Let  us  first  record  some  simple  results  that  follow  from  the  RIP  (771,  77)  assumption.  Most 
of  the  results  we  state  in  this  subsection  can  be  found  in  [20]  but  we  include  their  simple 
proofs  for  completeness  of  the  present  paper. 


Lemma  5.1  For  any  I  C  {1, . . . ,  N}  with  #(/)  <  m  we  have 

||$/||2  =  ||<M2  <  (!  +  v)- 


(5.1) 


Proof:  The  equality  in  (5.1)  holds  because  the  norm  of  a  matrix  and  its  conjugate  trans¬ 
pose  are  identical  (this  follows  for  example  from  the  fact  that  ||H||  =  sup||a.||=1;|i?/||=1  ytAx ). 
The  upper  inequality  follows  from  the  RIP  (777, 77)  assumption  because  for  any  x  G  MN , 
supported  in  /  one  has  ||$/a;||  =  ||$a;/||  <  (1  +  ?7)1/2||a;/||  =  (1  +  77)1/2 ||o;|| .  □ 


Lemma  5.2  For  any  I  with  #(/)  <  m  we  have 

(1  -  v)  <  <  (1  +  ??),  ||^/||  =  1. 


(5.2) 


and  therefore 


—  Idi ||  <  77, 


where  Idj  denotes  the  identity  matrix  of  size  #(/). 


(5.3) 


Proof:  The  upper  inequality  in  (5.2)  follows  from  Lemma  5.1  and  the  lower  inequality 
follows  from  RIP  (777, 77)  since 

||zj||  •  || $/$/:£/ 1|  >  ^$^$7X7  =  || $7-^7 1| 2  >  (1  -  V)- 

Hence  all  eigenvalues  of  $^$7  belong  to  (1  —  77, 1  +  77).  Thus  the  symmetric  matrix 
$7$7  —  Idj  has  its  eigenvalues  in  (—77,77)  which  confirms  (5.3).  □ 
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(5.4) 


Lemma  5.3  For  any  I  with  #(/)  <  m  and  any  x  with  supp(a;)  C  I ,  we  have 

%==||$/^||  <  (1  —  v) ll^ll  <  <  y/l  +  rj\\$ix\\. 

V1  +  V 

Proof:  The  upper  inequality  in  (5.4)  follows  from  Lemma  5.1.  The  two  lower  inequalities, 
follow  from  (2.3)  and  (5.2),  respectively.  □ 


Lemma  5.4  Suppose  that  T  and  J  are  sets  of  indices  such  that  ff{J  U  T)  <  m  and 
J  fl  T  =  0.  If  snpp((r)  =  T  one  has 


|$j$x||  <  T]\\x\\. 


(5.5) 


Proof:  Let  I  :=  JUT.  We  extend  the  matrices  to  size  ff{I)  x  n  by  adjoining  rows 

that  are  identically  zero  when  the  indices  are  in  I  \  J  and  I  \  T  respectively.  Similarly 
extend  Q?  so  that  it  has  columns  indexed  on  I.  Then,  since  x  is  supported  on  T  C  /,  we 
have 

-  $*t$t}x  =  [$}$/  -  Idr  -  ($*T<Z> T  ~  Idi)]x.  (5.6) 

Since  the  vectors  [<!>}$/  —  Idf\x  and  —  Id,j)\x  agree  in  all  coordinates  for  which  the 

latter  vector  is  nonzero,  we  can  take  norms  in  (5.6),  use  Lemma  5.2  and  obtain 


|<h}<f>a;||  <  ||[<h/<f)7  —  /d/]a;||  <  rj\\x\\ , 


as  desired. 


(5.7) 

□ 


As  a  summary  of  these  results,  under  the  assumption  RIP  (777,  77),  we  have  for  any  two 
disjoint  sets  A,  A'  C  {1, . . . ,  N}  such  that  #(A  U  A')  <  777,  and  for  any  vectors  u  E  1R^A\ 
v  E  Kn,  we  have 


Moreover,  for  any  A  C  {1, 


II^A'^amII  <  v\\u\\. 

(5.8) 

. ,  IV},  #A  <  777,  one  has 

ll$>ll  <  i1  +  v)1/2\H 

(5.9) 

II*a*a«||  I  (1  ±77)11^11 

(5.10) 

IK^a)-^!!  1  (1  i  77 ) — 1 1 1 77 1 1 . 

(5.11) 

We  conclude  this  section  with  some  remarks  on  solving  least  squares  problems.  Sup¬ 
pose  that  <f>  satisfies  RIP (777,77)  f°r  some  77  <  1.  Given  any  set  A  C  {1,...,1V}  with 
cardinality  <  m  and  any  input  vector  v  E  MIn,  the  least  squares  problem 


■u(A)  :=  argmin  ||u  — 

supp(z)CA 

has  a  unique  solution  given  by  the  Moore-Penrose  pseudo  inverse 

«(A)  = 


(5.12) 


(5.13) 


By  (5.10)  the  solution  can  be  computed  in  a  stable  way. 

Notice  that  =  P\v  where  P\  is  the  projector  onto  the  span  of  the  columns  f>u, 

v  E  A. 
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5.2  Analysis  of  DThresh 

In  this  section,  we  shall  analyze  the  performance  of  the  thresholding  algorithm  in  the 
deterministic  setting  and  prove  that  the  output  x*  is  a  good  approximation  to  x.  We 
fix  the  threshold  5  >  0  and  assume  that  $  satisfies  the  RIP  (m,  rj)  for  some  integer 
m  >  (1  +  2I2 )k  and  some  constant  r/  <  1.  We  shall  see  that,  if  77  and  S  are  chosen 
sufficiently  small,  the  scheme  DThresh  will  have  good  convergence  properties. 

For  our  first  lemma,  we  analyze  thresholding  when  the  input  vector  is  v  =  +  e  with 

u  G  S2fc.  Let  T  denote  the  support  of  u  so  that  by  assumption  #(T)  <  2k  and  let  A(v,  k ) 
denote  the  set  of  coordinates  v  with 

(5.14) 

Lemma  5.5  The  set  A(u,  k )  contains  at  most  coordinates. 


Proof:  Suppose  A(u,  k)  contains  a  set  /  of  <  m  coordinates.  Then  from  the  definition  of 
A (v,k)  and  (5.9),  we  have 


m^\\v\\2 

k 


<  ||$/u||2  <  {l  +  rj)\\v\\2  <  3/2||u||2 


(5.15) 


It  follows  that  #(/)  <  ^2  which  proves  the  lemma. 


□ 


The  following  lemma  will  be  key  to  our  error  analysis. 


Lemma  5.6  Assume  that  v  =  y  =  <Fa:  +  e  with  x  G  and  that  the  threshold  5  in 
DThresh[y,  5,  k]  satisfies  5  <  1/63.  Moreover,  assume  that  satisfies  RlP/m,?])  for  a 
fixed  77  <  1/63  and  m  >  (1  +  3 /52)k.  Then  for  the  iterates  xfi  j  =  0, 1, ... ,  produced  by 
DThresh[7/,  5,  k]  one  has 


\x  —  ad+1||  <  —  \\x  —  xj\ 

61 


(5.16) 


and 


\x 


XAJ+1\\  < 


3 

5 


4||e 


(5.17) 


Proof:  Let  S  be  the  support  of  x.  We  fix  j  and  use  the  abbreviated  notation  A  :=  AJ+i 
and  x  \=  x(K).  Let  T  :=  S  U  A  which  contains  the  support  of  x  —  x.  We  have 


x  —  x\\  <  (1  —  77)  1||<F^<1>t(^  —  £)|| 

<  (1-  ijr'dl'UI'M*-  x)  +e]||  +  ||4>re||} 

<  (1  -  -  i)  +  e]||  +  (1  +  i))1/2||e||}  (5.18) 


where  the  first  inequality  uses  (5.10)  (which  is  applicable  since  the  cardinality  of  T  is  <  rn 
because  of  Lemma  5.5),  and  the  third  inequality  uses  (5.1)  and  the  fact  that  the  inner 
product  of 

[A>t(x  -x)+e\=y-  <FA£  =  y-  PAy 


12 


with  any  column  of  <f>  with  index  inside  A  is  zero. 
We  estimate  the  first  term  in  (5.18)  as  follows 


ll<ts\i(®r(a:-i)  +  e]||  =  -  *a*)II 

<  ^111  +  11114^4(^-4)11 

=  11*^111 +  ll*^[*^-*i]||.  (5,19) 

To  estimate  the  first  term  on  the  right  side  of  (5.19),  we  use  the  fact  that  each  inner 
product  of  4>v,  v  G  S\ A,  with  r3  is  <  S/yfk  because  of  the  definition  of  A.  Since  #(S)  <  k, 
using  (5.1),  we  obtain 

l|$^\A[r-?]||  <  •\/<5||r7"||  =  \/5||$(x  —  XAj.)  +  e\\  <  \f8(l  +  77) 1//2 ||o:  —  x3\\  +  v^5||e||.  (5.20) 

For  the  second  term  on  the  right  side  of  (5.19),  we  note  that  A  is  disjoint  from  S  \  A 
and  that  A j  =  supp  x3  C  A,  so  we  can  invoke  (5.8)  and  obtain 

||<F^A[<Fa:J  —  <£>£]  <  r)\\xj  —  x||  <  r)[\\x  —  x3\\  +  ||a:  —  x||].  (5-21) 

If  we  use  now  the  estimate  (5.20)  and  (5.21)  in  (5.19),  we  obtain 
ll^prOr  —  x)  +  e] ||  <  Vd(l  +  rjY^Wx  —  x3\\  +  -\/<5||e||  +  rj[\\x  —  x3\\  +  ||a;  —  £||]  (5.22) 
We  now  insert  the  latter  estimate  in  (5.18)  and  obtain 
||x  — x||  <  (1  —  77)_1(77||a;  —  f||  +  ((1 +  ?7)1/2v/5  +  ?7)||a:  —  xJ||  +  [\/5  + (1 +  ?7)1^2]||e||).  (5.23) 
We  now  bring  the  term  involving  ||a;  —  £||  on  the  right  to  the  left  side  and  obtain 


(1  +  ri)1/2V6  +  77. 

11  11  -  (I-2/7)  " 

Recalling  that  x  =  x(AJ+1)  and  that  x3+x  = 
we  find 


x  —  x3  II  + 


V5  +  (1  +  r])1/2 
(1  -  2/7) 


(5.24) 


x(Aj+i)aj+1  is  its  best  fc-term  approximation, 


\x 


-x3+l\\  <  ^  x(Aj+i)||  +  p(Ai+i)  -x(Aj+i)Aj.+1||  <  2\\x  -  x(Aj+i)||,  (5.25) 


since  the  support  of  x  has  also  size  at  most  k.  Thus  we  deduce  from  (5.24)  and  (5.25) 
that 


\x  —  x3+1\\  < 


2((1  +  r])1/2V5  +  rf)  M__  ,  2(y/S  +  (1  +  vY/2) 


(1  -  2/7) 


\x  —  xJ  + 


(1  -  2/7) 


(5.26) 


When  we  invoke  our  restrictions  that  both  5  and  r]  are  <  1/63,  we  arrive  at  (5.16). 
To  derive  (5.17),  we  note  that  from  (5.16)  we  obtain 


,  i,  ,  11  ,+lll  ^  18..  ....  144,.  . 

\x  —  x\...  <  \\x  —  x3  <  — \\x  —  xJ\\  H - e 

1  A,+iii  -  11  11  -  61n  11  61  11  1 


(5.27) 
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Since  by  Lemma  5.5  #(S  U  A j)  <  k  +  ^  <  m,  we  can  apply  RIP(m,  rj)  to  conclude  that 
Ik  -  £(Aj)||  <  (1  -  y)~1/2\\$(x  ~  ®(Aj))||  <  (1  -  ri)~1/2[\\^(x  -  x(Aj))  +  e||  +  ||e||] 

=  (i-h)-1/2(lb-^A,i/ll  +  lkll) 

<  (1  -  v)~1/2[\\y  -  $XAj\\  +  ||e||]  <  (1  -  r})~1/2[\\$(x  ~  ^ Aj ) 1 1  +  2||e||] 

<  (l-r7)-1/2[(l  +  r?)1/2|k-xA.||  +  2||e||].  (5.28) 

Since  x J  is  the  best  h-terrn  approximation  of  x(Aj)  we  can  use  (5.25)  again,  to  conclude 
that 

/  ^  —1—  yo  \  1/2 

Ik  —  xj\\<2( - )  Ik  —  xa  II  +  4(1  —  h)  1/2|lell-  (5.29) 

VI  —  Tj  / 

Placing  this  in  (5.27)  and  using  the  restriction  r)  <  1/63  gives 

,,  „  3 ..  „  217 „  „ 

lk-^A,+1||  <  5 Ik -^-11  +  -g^llell, 

and  hence  (5.17).  □ 

We  can  derive  from  this  lemma  several  results  about  the  convergence  of  DThresh. 
For  this,  we  shall  use  the  following  lemma. 

Lemma  5.7  Suppose  that  x  G  1RN  and  a  <  l/s/2.  Let  A0  =  0  and  suppose  A  j  C 
{1, . . . ,  N},  j  =  1,2,... ,  jo?  are  sets  such  that 

Ik  -aW+J  <  a\\x-xAj\\,  j  =  0, . . . ,  j0  -  1.  (5.30) 

Then, 

\\x  -  xAj\\  <  (Tjix),  j  =  0, . . . ,  j0.  (5.31) 

Proof:  We  prove  this  by  induction  on  j.  This  is  obviously  true  for  j  =  0  and  we  now 
assume  this  is  true  for  any  j  <  j0  and  advance  the  induction  hypothesis.  Without  loss  of 
generality  we  can  assume  that  ki|  A  k2 1  >  •  •  ■  \%n\-  If  &j+i(x)  >  aaj(x),  then 

Ik  —  xaj+1||  <  a;|k  —  X\j  ||  <  aaj(x )  <  ctj+ i(x).  (5.32) 

On  the  other  hand,  if  aJ+1(x)  <  acrj(x),  then  a j(x)2  —  kj+i|2  <  o>2aj(x)2  or,  in  other 
words  kj+il2  >  (1  —  a2)crj(x)2.  Now,  by  our  induction  assumption, 

x2  <  a2<jj(x)2  <  (1  —  a2)crj(x)2,  (5.33) 

because  a2  <  1/2.  ft  follows  that  AJ+i  must  contain  every  %  <  j  + 1  and  so  we  again  have 
(5.31).  □ 

We  are  now  ready  to  complete  the  proof  of  Theorem  3.1, 
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Proof  of  (i):  We  want  to  show  that  ||a;  —  x*||  <  A||e||  with  A  <  280.  Let  j0  be  the 
terminating  index  of  the  algorithm.  Suppose  for  some  j\  <  j0  we  have  ||a;  —  x3'  ||  <  A||e||, 
for  some  A.  From  (5.16)  we  have  by  induction  for  any  j\  <  j  <  jo  that 


,•+1,,  18.,  144  „  „ 

\x  —  x3  <  —  a;  —  xJ\\  H - e  <  A\\e\ 

61  61 


(5.34) 


whenever  A  >  as  desired. 

For  the  next  case,  we  assume  that  the  algorithm  terminates  for  some  j  <  k ,  so  that 
A  j  =  0  and  hence  Aj  =  Aj_i  and  x*  =  x3  =  ad-1.  In  this  case,  (5.16)  gives  that 


*„  n,  18„ 

\x  —  x  =  \\x  —  xJ\\  <  —  a;  —  x 

61 


.-in  144 „  „  18 „  144 „  „ 

+  “XT  e  =  FT  x  _  x'  +  “FT  e  '  (5-35) 

61  61  61 


Thus,  1 1 x  —  x* ||  <  ^4|| e || ,  as  long  as  A  >  and  we  have  proved  this  case  as  well. 

The  last  possibility  is  that  ||a;  —  x3 \ \  >  A||e||  for  all  0  <j<k.  From  (5.29),  it  follows 
that 

1,1  -XA<n £ \CixlT^x-xi^  ~  <ivy)i73||e|1)'  (5-36) 

which,  under  the  assumption  that  ||a;  —  a;J||>A||e||,  yields 


I e  ||  < 


2(1  +  rj)lt2 

(1  -  //)V2A-4 


\x  -  X\. 


(5.37) 


This  together  with  (5.17)  yields 


(  8( 
\x-  XAj+1\\  <  ('6  +  7^ 


8(1  +  r/)1/2 


(1  —  'r])l/2A  —  4 


x  -  xAi 


0  <  j  <  k. 


(5.38) 


One  can  check  that  as  long  as  A  >  90  the  expression  in  parentheses  on  the  right  hand 
side  of  (5.38)  is  less  than  0.7  <  l/\/2-  We  are  then  allowed  to  employ  Lemma  5.7  to  find 


||a;  -  xAfc||  <  crk(x)  =  0.  (5.39) 

Using  this  in  (5.29)  gives  ||a;  —  ad||  =  ||a;  —  a;fc||  <  4(1  —  u)  1^2||ell  <  A||e||  which  concludes 
the  proof  of  (i). 


Proof  of  (ii):  For  an  arbitrary  signal  x  G  1RN ,  we  let  S  be  the  set  of  k  indices  corre¬ 
sponding  to  the  k  largest  entries  (in  absolute  value)  of  x  and  set 

y  =  &x  +  e  =  (Fes  +  +  e  =:  <Frs  +  e 

with  e  :=  e  +  <Frsc.  Applying  (i),  we  have 

||a;  -  af||  <  90  ||e||  <  90(||$3Sc||  +  ||e||)  (5.40) 


which  proves  (ii). 
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Proof  of  (iii):  Again,  let  S'  be  a  set  of  coordinates  corresponding  to  the  k  largest  entries 
in  x.  From  RIP  (m,  rj)  one  deduces  that 

\\®xs4  <  (1  +  r))1/2 +ak(x)t2y  (5.41) 

For  the  proof  of  this  see  either  [9]  or  [6] .  For  convenience  of  the  reader  we  sketch  it  here. 
Let  T0  :=  S  and  Tl+\  denote  the  set  of  indices  corresponding  to  the  next  k  largest  (in 
absolute  value)  entries  of  X(t0 u...UTi)c  so  that  ||xTi+1.||  <  k^^Wx^Wi-^-  The  last  set  Ts  may 
have  fewer  entries.  Employing  RIP  yields  then 

S  S 

||^scH  <  E  <  (i  +  y)1/2  £IMI 

i=  1  i= 1 

<  {l  +  r))1,2(ak(x)t2+^k-1/2\\xTi\\<>?)  <  (1  +  y)1/2 (ak(x)t2  +  ) • 

i=  1  ' 

If  we  use  this  in  (3.2)  we  arrive  at  (3.3).  □ 

6  A  thresholding  algorithm  without  thinning 

The  scheme  SThresh  invokes  a  thinning  step  at  each  iteration.  It  is  not  clear  whether 
this  is  necessary  for  the  successful  performance  of  this  algorithm.  This  prompts  us  to 
consider  what  can  be  proved  without  such  thinning.  In  this  section,  we  shall  introduce  and 
analyze  a  greedy  algorithm  based  only  on  thresholding  for  the  decoding  of  the  information 
y  =  <ha:  +  e.  We  shall  see  that  we  obtain  instance  optimality  in  probability  except  for  a 
small  additive  factor  that  can  be  made  as  small  as  we  wish  (as  n,  N  — »  oo). 

To  this  end,  we  shall  need  an  additional  property  of  the  family  of  random  matrices 
that  can  be  formulated  as  follows: 

P2:  For  any  z  G  Mn,  l  G  {1, . . . ,  N},  and  5  >  0,  there  is  a  set  5 ,  l )  such  that 

\(z,4>i)\  <  5\\z\\t2,  wg02(z,<5,I)  (6.1) 

and 

pmz,S,l))<b2e-a^,  (6.2) 

where  62  and  c2  are  absolute  constants. 

Throughout  this  section  we  shall  assume  that  for  each  n,  N ,  the  n  x  N  matrices  <3>(u;) 
satisfy  PI,  P2  and  that  the  entries  in  $  are  independent  identically  distributed  draws 
of  a  single  random  variable.  In  particular,  the  results  of  this  section  cover  the  random 
Bernouli  and  Gaussian  matrices. 

We  shall  assume  throughout  this  section  that  the  number  of  measurements  factors 
as  n  =  am  where  both  a  and  m  are  integers.  We  define  <Li  to  be  the  submatrix  of 
consisting  of  its  first  m  rows,  <h2  the  submatrix  of  <L  consisting  of  the  next  m  rows  and 
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so  on  up  to  <f>a.  Each  of  these  matrices  is  a  randomly  drawn  m  x  N  matrix.  We  will 
generally  denote  such  a  generic  m  x  N  randomly  drawn  matrix  as  <f>o- 

We  now  describe  a  thresholding  algorithm  for  decoding  y  =  $(#).  Given  a  set  A  C 
{1,2,...,  N}  of  column  indices,  we  denote  by  P\(y)  the  projection  of  y  onto  span{0j}jeA. 
We  also  denote  by  X(A)  the  linear  space  of  all  x  G  1RN  which  are  supported  on  A.  The 
algorithm  will  find  a  set  of  column  indices  A  =  A (y)  which  will  be  used  to  decode  y  as 
follows:  Writing  P\(y )  =  'Yhie\xf4>ii  and  denoting  by  x  G  RN  the  vector  dehned  by 
Xi  :=  xf,  i  G  A,  Xi  —  0,  i  A,  we  set 

A  (y)  =  x.  (6.3) 

To  hnd  the  set  A  used  in  the  dehnition  (6.3),  we  start  with  A0  =  A0  :=  0.  At  the  j-th 
step  of  the  algorithm,  the  algorithm  will  find  a  set  A j  of  new  coordinates.  This  is  added 
to  the  existing  “activated”  coordinates  to  give  A  j  :  =  U2=0A,:  =  U)=1A,:  as  the  current  set 
of  coordinates  in  our  approximation  to  x.  We  do  not  want  to  ever  have  more  than  2k 
coordinates  in  our  final  set  A (y).  So  we  stop  the  algorithm  as  soon  as  #Aj  >  2k.  In 
fact,  we  trim  the  last  set  of  coordinates  found  in  order  to  be  sure  the  final  set  A (y)  has 
cardinality  <  2k. 

Given  i  G  {1, . . . ,  a},  we  denote  by  y^  and  the  portion  of  y,  e,  respectively,  obtained 
by  setting  to  zero  all  coordinates  of  y,  e  whose  indices  are  not  in  {(i  —  l)m  +  1, . . . ,  im } 
while  keeping  the  remaining  coordinates  intact.  Suppose  5  G  (0, 1)  is  a  given  threshold 
tolerance.  At  present,  we  put  no  restrictions  on  <5  but  later  the  validity  of  our  theorems 
will  require  5  to  be  sufficiently  small  but  fixed. 

At  the  first  step,  we  define  r1  :=  yqj  =  $1(0;)  +  ep],  compute  ||y[i]||  and  consider  all 
coordinates  v  for  which 

\(rl,4>l)\  >  <SAT1/2||r1||.  (6.4) 

Assume  for  the  moment  that  there  are  at  most  2k  coordinates  v  satisfying  (6.4).  Then 
we  take  Ax  :=  Ai  as  the  set  of  first  activated  coordinates  and  define  A" (Ax)  and  compute 

a:1  :=  argmin  || -  y{1  j||  =  ||y[ij  -  P^Vi i]||-  (6.5) 

zex(  Ai) 

The  vector  x1  is  the  solution  to  a  least  squares  problem  and  has  a  simple  closed  form 
representation.  The  Gramian  matrix  which  needs  to  be  inverted  to  compute  x1  is  nonsin¬ 
gular  with  high  probability  because  of  the  RIP  property  given  below.  Finally,  we  dehne 
r2  ■■=  y\ 2]  -  <h2^1- 

If  there  are  more  than  2k  coordinates  satisfying  (6.4)  we  dehne  Ai  :=  Ai  as  the  set  of  2k 
coordinates  which  have  the  largest  inner  product  in  (6.4)  (with  ties  handled  arbitrarily). 
We  compute  xl  and  r2  for  this  trimmed  set  as  before.  We  stop  the  algorithm  and  output 
a*  :=  1  and  A (y)  :=  Ax  and  x  :=  x1  as  our  decoding. 

The  general  step  of  the  algorithm  is  the  following.  At  the  start  of  the  j-th  step  of  the 
algorithm,  we  have  rJ  :=  yp]  —  We  consider  the  set  of  all  coordinates  is  such  that 

|('G,0i)|  >  <$Ar1/2|H|.  (6.6) 

If  the  union  of  this  new  set  of  coordinates  together  with  A j_i  has  cardinality  <  2k,  we 
take  Aj  as  the  set  of  all  these  coordinates  and  dehne  Aj  :=  Aj_!  U  Aj  and 

x3  ■=  argmin  ||yb]  -  fyz ||  =  || y\j]  -  P^W ill-  (6.7) 

zex(Aj) 
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If  the  cardinality  of  Aj^UAj  exceeds  2k,  we  apply  trimming.  Namely,  we  define  A j  as  the 
subset  of  coordinates  from  (6.6)  with  largest  inner  products  such  that  the  resulting  set 
A  j  :=  Aj_i  U  A  j  has  cardinality  2k.  In  the  latter  case  we  stop  the  algorithm  and  output 
A  (y)  :=  A  j,  a*  :=  j,  and  x  :=  x? . 

If  the  algorithm  has  not  been  stopped  by  a  trimming  step  then  we  stop  it  when  j  =  a 
and  output  a*  =  a ,  A (y)  =  Aa  and  x  =  xa  as  our  decoding  of  y.  Here,  trimming  is  applied 
on  this  last  set  if  necessary  to  keep  A (yy)  to  have  cardinality  <  2k. 

We  summarize  the  scheme  as  follows: 

Threshll: 

(1)  Set  A0  =  A0  :=  0,  x°  :=  0,  j  =  1,  r1  =  y^y, 

(2)  Dehne  A  j  consist  of  those  u  such  that  the  inner  product  {r\4>v)  satisfy  (6.6). 

(3)  If  #(A  j_i  U  A j)  <  2k,  set  A  j  :=  A  j_i  U  A  j,  compute  x J  according  to  (6.7),  set 
rJ+1  =  y\j+i\  —  <hj+ ixi  and  j  +  1  — >  j  and  go  to  (2). 

(4)  If  #(A  j_i  U  A j)  >  2k  or  if  j  =  a,  define  A  j  by  trimming  this  set,  and  output  a*  =  j 
x  :=  x^  computed  according  to  (6.7). 

Note  that  each  of  the  quantities  appearing  above  is  stochastic  and  depends  on  the 
draw  u  G  fi,  i.e.  we  have  =  <3>j(u;),  ad  =  x\u>),  but  to  avoid  cluttering  of  notation  we 
often  suppress  this  dependence  in  notation  when  it  is  clear  from  the  context. 


7  Analysis  of  algorithm  Threshll 

The  main  result  about  Threshll  reads  as  follows. 

Theorem  7.1  Given  any  0  <  S  <  The  thresholding  decoder  applied  with  this  choice 
of  5  to  n  x  N  random  matrices,  n  =  am,  satisfying  PI  and  P2,  satisfies  the  following. 
For  any  x  €  1RN  and  any  1  <  k  <  N,  there  exists  a  set  =  fi4(x,  k)  satisfying 

p(Ql)  <  a  (V-C0™/16+3fclog(^  +  bie-cim/4  +  (2h  +  b2)Ne-cmS2/k  +  bie~Cim52^j  ,  (7.1) 

such  that  for  any  u  in  H4  and  any  noise  vector  e,  the  decoded  vector  x  of  the  above  greedy 
decoder  satisfies 


\x 


x\ 


<2-a/2\\x\\+C*(crk(x)  + 


max 

j=l,-,a* 


where  C*  :=  max{ a/408,  [1  +  3^  +  4)^]}. 


We  have  the  following  corollary  to  this  theorem. 


(7.2) 
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Corollary  7.2  Suppose  that  r,s>  0  are  given  and  that  the  random  process  generates 
m  x  N  matrices  <3>o(u;)  which  satisfy  PI  and  P2.  We  use  n  x  N  matrices  d>(u;),  to  E  fi, 
with  n  =  am  and  a  :=  \'2r  log  TV] ,  for  encoding  and  use  the  thresholding  algorithm  with 
5  E  (0,  for  decoding.  Then,  for  a  sufficiently  small  constant  c(<5)  >  0  we  have  the 

following.  For  each  x  E  MN ,  there  is  a  set  fi4  C  fi  with 

p{ni)  <  N~s  (7.3) 

such  that  for  any  draw  cu  G  fi4  and  and  any  noise  vector  e,  one  has  for  each  k  < 
c(S,  r)n/ (log  IV)2 

\\x  —  x\\  <  N~r  +  C(ak(x)  +  max  ||eui||),  (7.4) 

j= 

with  C  —  C(S)  depending  only  on  5. 

Proof:  We  apply  Theorem  7.1  with  the  values  of  a  and  5  as  specified  in  the  statement 
of  the  Corollary.  We  can  take  fi4  as  the  set  in  that  theorem.  Then  p( f^)  is  bounded  by 
(7.1).  The  second  and  fourth  terms  on  the  right  hand  side  of  (7.1)  are  both  less  or  equal 
to  C'ae~c  711  and  so  is  the  first  term  if  c(S,  r)  is  small  enough,  for  the  range  of  k  described 

_ c  log  N 

in  the  theorem.  The  remaining  third  term  is  bounded  by  aN(2b\  +  62)e_  ct5>r)  .  Thus  each 
of  these  terms  can  be  bounded  by  N~s/4  provided  c(S,  r)  is  small  enough  and  we  therefore 
obtain  (7.3).  The  estimate  (7.4)  follows  from  (7.2)  because  2_a/2  <  N~r.  □ 

The  remainder  of  this  section  is  devoted  to  the  proof  of  Theorem  7.1  and  a  short 
discussion  of  its  ramifications.  The  proof  is  somewhat  simpler  when  the  noise  e  in  the 
observation  is  zero  and  the  reader  may  wish  to  make  that  assumption  on  first  reading. 
Throughout  the  remainder  of  this  section,  for  a  given  but  fixed  x  E  MN  and  a  given  k, 
we  let  Sk  denote  a  set  of  its  largest  k  coordinates. 

In  accordance  with  the  above  initialization  we  shall  define  x°  :=  0  for  the  purposes 
of  the  analysis  that  follows  below.  We  begin  with  the  following  lemma  which  bounds 
1 1 x •—  x3\\  by  a  multiple  of  \\y^  —  &jXJ ;||.  Note  that  ad  is  stochastically  dependent  on  dq . 

Lemma  7.3  Given  x  E  1Rn  and  k  >  1,  define 

:=  fi3(x,  k)  :=  n“=1[Q0(3 k,  1/2,  %)  n  Q^xsc,  1/2,  d>j)]  (7.5) 

where  the  sets  fl0  correspond  to  RIP  and  the  sets  Hi  correspond  to  PI.  Then, 

p(nc3)  <  &0ae-com/16+3fclog<'^')  +  &iae-cim/4,  (7.6) 

and  for  each  u>  E  H3  and  1  <  j  <  a* ,  we  have  ( for  ad  =  ad  (u>)) 

\\x  -  ad||  <  (1  +  V3)ak(x)  +  y/2(\\ j/yj  -  $,ad||  +  He^-j ||).  (7.7) 

Proof:  We  first  check  the  measure  of  O3.  According  to  Properties  PI  and  RIP  (see 
(2.4))  we  have 

a  a 

p(Sij)  <  ^p(Si„(34,l/2,<I>j)“)  +  ^;p(!J1(is.,l/2,Si3V) 

3= 1  1=1 
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(7.8) 


<  6o«e_c°m/16+3felos{^;)  +  biae~Cirn/A . 


This  proves  (7.6) 

To  verify  (7.7),  we  have 

\\x  -  x3\\  <  ||a;  -  xsj  +  \\xSk  -  x3\\  =  ak(x)  +  \\xSk  -  x3\\.  (7.9) 

We  know  that  xsk  —  x 3  is  3/c-sparse  if  j  <  a*.  Hence,  for  u  G  H3,  we  have  from 

RIP(3M/2), 

\\xSk  -  x3\\  <  V2\\$jxSk  -  $jX3\\,  1  <j<a*.  (7.10) 

This  gives  for  1  <  j  <  a*, 

\\xsk  —  x3 1|  <  \/2{||$ix5fc-$ix||  +  ||$J-(x-^)||}  =  \/2{||$jX5c||  +  ||$j(x-^)||}.  (7.11) 
Since,  by  PI,  11$^  ||  <  y/sf2\\xSc ||  =  \/3/2 crk(x)  and  since 

V\j]  ~  fI)HJ  =  ®j(x  ~x3)  +  ep], 

we  have  proved  (7.7).  □ 

Onr  next  two  lemmas  are  going  to  show  the  quantitative  effects  of  thresholding  and 
will  later  be  used  to  provide  error  bounds  for  onr  algorithm. 

Lemma  7.4  Let  5  G  (0,1),  u  G  1RN ,  and  let  A  :=  A(u,5,k)  be  the  set  of  all  indices  v 
such  that  | uv\  >  5k~l/'2\\u\\.  Then, 

|| u  —  ua||2  =  ^  \uv\2  <  3<52||w||2  +  crlk(u).  (7.12) 

v<£A 

Proof:  Let  A0  be  a  set  of  the  3 k  largest  coordinates  of  u  so  that  Yh>iA0  \UA'2  =  al  k(u)- 
We  have 

^  |u„|2  +  ^  \uu\2  <  3k52\\u\\2/k  +  alk(u)  =  352\\u\\2 +  alk(u)  (7.13) 

i/^A  veA0rAc  i/eAgnAc 

where  we  used  the  fact  that  A0  has  cardinality  3k.  □ 

Lemma  7.5  Let  u  G  1RN  arid  let  v  :=  4>o(m)  +  epp  e[o]  G  Mn,  where  $0  =  $0(0;)  is  an 
m  x  N  matrix  randomly  drawn  from  our  stochastic  process  which  satisfies  PI  and  P2. 
Moreover,  assume  that 

ho]ll<IMI/4.  (7.14) 

Let  A\v,  S,  k,co)  be  the  set  of  all  v  such  that 

\{v,M\ >%||V‘/2.  (7.15) 

Then,  there  is  a  set  tt(u,  5,  k,  $0)  such  that 

cm  (5^ 

p{Llc(u,  6,  k ,  <h0))  <  (2&i  +  b2  +  l)Ne~~>^  (7.16) 
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where  c  :=  min(ci,  c2)/64  and  bi,b2,ci,c2  ore  the  constants  in  PI  and  P2,  and  for  any 
c o  G  f2(w,  5,  k ,  $o)  with  5  <  1/12,  we  /zai>e 

A(u,  25,  /c)  c  A'(u,  6,  k,  oj)  (7-17) 


and 

A'(v,  6,  k,  uj)  C  A(u,  6/2,  k),  (7-18) 

where  the  set  A (u,  5,  k )  is  defined  in  Lemma  l.f. 


Proof:  For  each  v  G  A (u,  6,  k ),  let  u(v)  u  —  uv6u  and 

v(v)  :  =  $0(«H)  +  e[0]  =  v  -  uvfiv, 

where  6V  G  1RN  is  the  z/th  coordinate  vector.  It  follows  that  ||rt(i/)||  <  ||w||  for  each  is  and 
v  =  uu(pu  +  v(is).  According  to  Property  PI,  for  each  is,  there  is  a  set  klfiufis),  5,  <F0)  such 
that 

p(fl1(u(z/),5,$0)c)<&1e-cim52  (7.19) 

and  for  all  u>  G  Oi  (u(is),  5,  $0), 


v(is)\\  <  VT+6\\u(v)\\  +  1 1  e  [0]  ||  <  (Vl  +  6  +  l/4)||tt||,  cu  G  Qi(u(is),  6,  4>0),  (7.20) 


where  we  have  used  the  assumption  that  ||e[0]||  <  ||«||/4.  Observe  that  v(is)  is  stochasti¬ 
cally  independent  of  Therefore,  according  to  Property  P2,  there  is  a  set  H2(v(is),  6/(8 Vk),  is,  $0) 
with 

p(tt2(v(is),  6/ (8 Vk),  is;  <F0)C)  <  b2e  eik~  (7.21) 

and,  such  that  for  all  us  G  fl2(v(is),  6/(8y/k),  is,  <P0),  we  have 


\{v(is),(j)u)\  < 

< 


5||u(z/)H  6(V  1  +  6  +  1/4)||  u(is) 


< 


8 \fk 
5y/T+5\\u\ 


i6\fk 


8 \fk 

LU  G  Vt2(u(is),  6,  is;  <P0) 


where  we  have  used  (7.20)  and  the  fact  that  ||-u(z/)||  <  || u 
We  now  dehne 


(7.22) 


Q(u,  5,  k,  $0)  := 

P|  (Lli(u(is),  6,  <F0)  O  fli(^,  6,  $0)  O  Vt2(v(is),  6/(8\fk),  is,  $0)  j  0  Llfiu,  6,  <h0), 

pGA  (u,8,k) 

Then,  this  set  satisfies  (7.16)  because  of  (7.19),  (7.21)  and  property  PI  applied  to  6U  and 
u. 

We  now  prove  (7.17).  For  any  u>  G  Cl(u,  6,  k,  <h0)  and  any  is  G  A (u,  26,  k ),  by  (7.22), 

\(v,fiv)\  >  \uv\(l  -  5)  -  \{v(v),<j>v)\ 

>  2/c_1/2(5(1  —  <5)||u||  —  /c_1/2(5/6)v/r+^||u||.  (7.23) 
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Invoking  PI  with  respect  to  u,  we  conclude  that 


a/T+1)\  ||<MI 


vM\  >  -^=(2(1  —  S)  -  ) 


6  7^1 Th 


Now  observe,  again  by  PI,  that  (7.14)  implies 


(7.24) 


lle[o]ll  < 


4  vT+7 


ll$o«||  < 


4V1  +  <5 


(IMI  +  IMI), 


whence  one  infers 


n  ,,  1  „  „  1„ 

e  roi  <  - .  - m  <  -  m 

11  1 JN  “  4(vT+7-  1/4)  3"  1 


Thus,  combining  (7.24)  and  (7.25),  gives 

2  2^  U 


6  )}vt|,!| 


(7.25) 


(7.26) 


One  can  verify  that  for  <5  <  1/10  the  factor  in  curly  brackets  is  indeed  larger  than  1  which 
shows  (7.17). 

We  now  prove  (7.18).  For  any  u  G  Q(u,  8,  k,  <Lo)  and  any  v  A(u,8/2,k),  we  use 
again  (7.22)  and  PI  for  u  to  conclude  that 


I(v,0„)j  <  \ul/\(l  +  5)  +  \(v(v),<j)u)\ 

<  (1  +  8)(8/2)k~1/2\\u\\  +  k~1/2{5/6)Vl  +  S\\u\ 

-  y/k\  2  6  J  " 

.  8  / 1  +  h  \/l  +  8\  1  . . . 

-  +  ~r~ )tt^i(I11’11  +  l|e|o|ll) 

^  8  /1  +  8  \/l  +  8 


6  7  3  Vl^8 


(7.27) 


where  we  have  used  (7.25)  in  the  last  step.  One  can  verify  that  for  <5  <  1/12  one  has 
j  <  1  so  that  \{v,<t>v)  \  <  <5/c_172||ri||.  This  shows  that  any  such  v  could 

not  have  been  chosen  for  A'(v,  8,  k,u).  This  completes  the  proof  of  the  lemma.  □ 


We  can  now  define  the  set  O4  that  appears  in  the  statement  of  Theorem  7.1  and 
Corollary  7.2  as 

O4  :=  Q4(x,  k,  h)  :=  Q3(x,  k)  0  0“=1[fi(x  —  x^1, 5,  k,  $,■)  0  ^(xgc,  8,  $,-)],  (7.28) 

where  O3  is  the  set  in  Lemma  7.3,  the  next  sets  in  brackets  come  from  Lemma  7.5  and 
the  last  sets  come  from  PI.  Let  us  note  that 

p(n4(x,k,8)c)  <  a  {(60e-com/16+3fclog(^) +  &ie-c im/4 
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+  (2  b1  +b2  +  l)Ne-cmp/k  +  bie~cim52},  (7.29) 

Indeed,  this  estimate  follows  from  (7.6),  (7.16)  and  PI  and  the  fact  that  a*  <  a.  The  set 
04  will  be  used  in  the  remainder  of  this  section. 

The  next  lemma  shows  a  certain  reduction  of  the  error  at  each  iteration  of  the  algo¬ 
rithm. 

Lemma  7.6  For  each  1/12  >  <5  >  0,  each  x  £  1RN ,  each  u  £  Q4 (x,k,6)  and  each 
1  <  j  <  a*,  we  have 


\\x  —  xj \\2  <  A\\x  —  xj  1\\2  +  Bal(x)  +  C\\e[j]\\2,  (7.30) 

where  A  :=  96<52,  B  :=  204,  and  C  =  196.  This  same  estimate  holds  for  j  =  a*  provided 
this  last  set  was  not  trimmed. 

Proof:  We  fix  a  value  of  j  and  assume  that  u  £  Q4(x,  k,  h).  At  the  beginning  of  the  j'-th 
step  of  the  decoding  we  have  in  hand  Aj_i  and  ad-1  where,  according  to  our  initialization 
of  the  algorithm,  a:0  :=  0  and  Ao  :=  0.  Thresholding  on  the  vector  $*r7  =  $*(<!/,■  (a;  — 
aW1)  +  epj),  now  gives  the  set  Aj  and  the  new  composite  set  A  j.  By  our  assumption  on 
j,  there  was  no  trimming  involved. 

We  shall  distinguish  between  two  cases:  (a)  ||ep]||  <  \\x  —  x^l\\/ A  and  (b)  ||e[j]||  > 
\\x  —  a4_1||/4. 

In  case  (a),  since  is  drawn  independently  of  x  —  aW1,  we  can  apply  Lemma  7.5 
for  u  :=  x  —  xJ~l  and  e[0]  :=  epj.  It  says  that  for  u  £  f2(a;  —  ad-1,  S,  k,  <hj),  the  set  A  j 
contains  all  coordinates  v  for  which  \xv  —  xf~1\  >  25k~1^2\\x  —  x^l\\.  Hence,  we  can  apply 
Lemma  7.4  to  u  =  x  —  xJ~]  and  obtain  for  w  :=  x 7-1  +  (x  —  xJ~l)t\  ,  upon  noting  that 
(x  —  x^~1)ac  =  x  —  w, 

||x  —  w\\2  <  1252\\x  —  x^l\\2  +  &lk(x  —  a:7-1)  <  12<52||a;  —  a;-7- 1 1| 2  +  a2,(a;),  (7.31) 

where  the  last  inequality  uses  the  fact  that  cr3k(x  —  ad-1)  <  ak{x )  because  a:7-1  is  in  H2fc. 
Starting  with  Lemma  7.3,  we  can  now  estimate 

\\x-xJ\\  <  (1  +  Vs)ak(x)  +  V2(\\y^  -  <&jX3 1|  +  \\e^  ||) 

<  (1  +  V3)ak(x)  +  V2(\\ym  -  <L,w|l  +  \\ej\\) 

<  (1  +  V3)ak(x)  +  V2(\\^(x  -  W)||  +  2||ep]||)  (7.32) 

where  the  second  to  last  inequality  uses  the  minimality  of  the  least  squares  solution  in 
the  space  X(Aj)  of  vectors  in  1RN  supported  in  Aj. 

We  now  want  to  estimate  the  middle  term  in  (7.32).  We  cannot  use  PI  directly 
because  w  depends  on  Instead,  we  write  x  —  w  =  x  —  xsk  +  ( xsk  —  w)  and  find 

Il$i0c-u0||  <  ||$j(x-xgfc)||  +  \\^j(xSk  -  w)|| 

<  Vl  +  5  (||xs=||  +  ||a;Sfc  -  w\\) 

<  Vl  +  5  (gk(x)  +  ||x  -  Xgfc||  +  ||x  -  w||) 

<  V 1  +  5  ( 2ak(x )  +  \\x  -  w\\ ) .  (7.33) 
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Here  in  the  second  inequality  we  used  PI  for  xsck  and  RIP(3/c,<5)  for  <f>j. 

We  now  subsitute  (7.33)  into  (7.32)  to  obtain 

||x  — ad||  <  (1  +  V3  +  2V2\/l+~d)ak(x)  +  V2y/Y+~5\\x  -  w||  +  2  V^2  He^j  || 

<  7ak{x)  +  2\\x-w\\+2V2\\elj]l  (7.34) 

because  0  <  5  <  1.  We  square  this  last  inequality  and  then  use  (7.31)  to  arrive  at 

||a;  —  ad  || 2  <  2(7o'l(x)  +  2v/2||e[y] ||)2  +  8||a;  -  w\\2 

<  204u^(a;)  +  96<52||a;  —  1 1| 2  +  32||e[j]||2 

<  A\\x- x^f  +  Ba2k(x) +  C\\e{j]\\2,  (7.35) 

as  desired. 

Now  we  turn  to  case  (b)  1 1 e [j] 1 1  >  ||a;  —  od_1||/4,  and  use  again  that  is  drawn 
independently  of  x  —  ad-1,  so  that  PI  yields 

Ileb1  II  —  4^/l+S  1 1  (X  ~  XJ  X)H 
=  i7f+|ll^b1  —  ®jxj  1  ~  eb']ll 
—  STT+J^l^b]  —  II  —  lleb']ID- 

Since  ad  minimizes  \\y\jj  —  over  A"(Aj),  and  A j  contains  the  support  of  ad-1,  we 
conclude  that 

(4^TTA  +  l)||eb]||  >  \\ijj]  -  QjX3-1  II  >  11^]  -  QjX3  II .  (7.36) 

Note  that  (4\/l  +  5  +  1)  <  6  for  5  <  1/10.  We  invoke  now  Lemma  7.3  to  conclude  that 

||o;-ad||  <  (1  +  Vs)ak(x)  +  7V^||e[j]||. 

Squaring  both  sides  confirms  (7.30)  and  finishes  the  proof.  □ 

Proof  Theorem  7.1:  We  can  now  prove  our  main  result  about  the  greedy  algorithm  of 
this  section.  Let  6  <  l/8\/3-  Then  A  =  96<52  <  1/2  and  <5  <  1/12. 

We  will  continue  to  denote  by  Sk  a  set  of  k  coordinates  corresponding  to  the  largest 
entries  (in  absolute  value)  of  x.  We  shall  consider  two  cases. 

Case  1:  In  this  first  case,  we  assume  that  the  algorithm  never  employed  trimming.  In 
particular  this  means  that  a*  =  a  and  x  =  xa.  We  introduce  the  abbreviated  notation 
Ej  :=  ||a;  —  ad||2,  rjj  :=  1 1 e [j] 1 1 2 ,  and  o  :=  o\(x).  Then,  an  application  of  Lemma  7.6  at 
each  iteration  gives 

Ej  <  AEj_i  +  Bo  +  Cr]j  j  =  2, . . . ,  a* .  (7.37) 

Iteratively  applying  this  inequality  gives 

7?  a— 1 

llx  -  x\\2  =  \\x  -  xa\\2  <  Aa\\x\\2  +  - — -jol(x)  +  Cj2AiVj-i 

i= 0 

a—  1 

<  2-a||o:||2  +  2Bo2k(x)  +  C[VA]  max  m  (7.38) 

'  7=1,—, a* 

i= 0 
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Since  A  <  1/2,  this  proves  (7.2)  in  this  case. 

Case  2:  The  remaining  case  we  have  to  consider  is  when  trimming  was  used  to  create 
the  last  set  Aa*.  In  this  case  the  following  observation  is  useful. 


Remark  7.7  For  any  v  G  Aa*  one  has  for  any  w  6  h4 

S  .  <5 

— 7=  max  epi  +  \xv\  >  — =\ 

2\[k  i=1. ■■■>“*  m  2 V2k 


x  ~  xAn *  I 


(7.39) 


To  prove  (7.39),  for  any  v  G  Aa*  consider  the  hrst  iteration  j  when  v  G  Ar  The  inner 
product  of  r3  with  was  by  dehnition  larger  than  <5/c“1/2||r7||.  It  follows  from  (7.18)  that 
for  to  G  fhi(:r,  k,  h) 

1  11  (7.40) 


\Xv\  > 


2  Vk 


Since  x 3  1  is  supported  on  A^  j ,  one  obtains 

xAj^i  II  <  ||x  —  xJ  l  II  <  (l^<J)_1/2||$i(a:-a^“1)||  <  V2(\\r3\\  +  ||eb1||)  (7.41) 


\x 


where  the  last  inequality  uses  that  w  6  fl4  allows  us  to  apply  PI  for  x  —  x 3  1  as  well  as 
the  fact  that  <5  <  1/2.  This  confirms  (7.39).  □ 

Since  trimming  was  used  we  have  #(Aa.)  =  2k.  It  follows  that  Aa*  contains  at  least  k 
coordinates  from  S jb  Since  (7.39)  holds  for  each  v  G  Aa*  n  S%,  it  follows  that 


v£Aa,nS? 


r- 


On  the  other  hand,  we  have 


E 


t/eA^n  Sck 


(M  +  maxJ=i ||eb]||)  <  2  EiyeAa*nS^  \x- 


,  0 62#(Aa*nS?) 

+2 - 77 — ^  maxi=i. 


Ileb1  IP 

<  2 ak(x)2  +  52maxj=h...,a*  ||eb]||2. 


Thus  we  conclude  that 


\x-x~aJ<U(x)  +  2V2  _maxj|eb]||, 

0  j — I,”*, a 


(7.42) 


where  we  have  used  the  fact  that  #(Aa*  0  S%)  >  k. 

We  now  turn  to  estimating  ||a;  —  x||.  We  begin  with 


\x  —  x\\  —  \\x  —  x 


<  +  \\xSk-xa*\\  <  <Tk(x)  +  \\xSk-X° 


(7.43) 


The  second  term  was  estimated  in  (7.11)  in  the  proof  of  Lemma  7.3  (with  j  =  a*). 
Using  that  estimate  and  the  minimality  of  the  least  squares  solution,  we  obtain 


\x 


—  x||  <  (1  +  V3)ak(x)  +  V2\\$a*(x  -  xa 
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<  (1  +  V3)<Jk(x)  +  V2(||y[a.j  -  $a*a;a*||  +  ||e[a»]||) 

<  (1  +  V3)ak(x)  +  V^dl^j  -  <$>a*x Aa*  II  +  max  ^  ||eb]||) 

J  —  ia 

<  (1  +  Vs)ak(x)  +  \/2(||$o*(^-^5fc)||  +  ||$a*(zsfc  -Xa0.)||  +2  max  ||eb-]||.)- 

Now  U6O4C  ^3.  Looking  at  the  definition  of  ^3  in  (7.5),  we  see  that  we  can  apply  PI 
and  RIP  to  conclude  that 

\\x  -  ^||  <  (1  +  V3)ak(x)  +  a/3(||x5c||  +  ||a;sfc  -  x~K,  ||)  +  2vA>  max  ^  ||eb]|| 

j — I,---, a* 

<  (1  +  2Vs)ak(x)  +  \/3(||x  -  xs fe||  +  ||a;  -  x^  J|)  +  2^2  max  ||eui||. 

“  j=l,—,a* 

<  (1  +  3\Z3)ak(x)  +  V3\\x  -  xao,  ||  +  2V2  .max*  ||eb]|| 

<  [1  +  3V3  +  ^^-]crk(x)  +  2(v/2  +  \/6)  jnax  J|eb]||,  (7.44) 

where  the  last  inequality  uses  (7.42).  This  shows  that  (7.2)  holds  in  the  second  case  as 

well  and  completes  the  proof  of  the  theorem.  □ 

We  conclude  this  section  with  some  remarks.  The  above  argument  shows  that  as  soon 
as  trimming  is  necessary,  i.e.  the  sets  A  j  build  up  fast  enough,  the  decoder  is  actually 
instance-optimal  in  the  original  sense  when  e  =  0,  see  (7.44). 

Remark  7.8  Even  when  trimming  does  not  occur  in  the  algorithm,  we  still  have  the 
following  estimate:  suppose  that  e  =  0  and  2 q  :=  ffAa  <  2k,  then  one  has 

(7.45) 

Thus,  as  long  as  the  size  of  Aa*,  i.e.  the  support  of  the  decoder  output  is  comparable  to  k, 
one  can  bound  the  error  by  a  constant  multiple  of  the  corresponding  q-term  approximation 
error. 

Proof:  Remark  7.7  together  with  the  argument  leading  to  (7.42)  yields 

/  n2  ^  <5V|  2 

0~k(x)  >  — —  X  —  X\ 

y  ’  -  8k  11  A“" 

Inserting  this  in  the  argument  leading  to  (7.44),  confirms  the  claim.  □ 


\x  -  x||  <  (5  +  V3)a, k(x)  +  5  M 


An  extreme  case  where  the  algorithm  would  not  perform  well  is  Xi  =  N  17,3 ,  i  = 
. . ,  N,  in  which  case  Ck(x)  =  \ which  stays  close  to  one  in  the  range  of  k  under 


consideration.  Since  the  sets  A'(rJ  ,  S,  k,  uf)  are  for  most  u>  contained  in  the  sets  A(x  — 
x:’~1,5/2,k),  but  ||a;  —  2d_1||  >  a^ix),  no  entry  would  actually  satisfy  \xf\  =  iW1/2  > 
^A^||x  —  ad-1 1|  so  that  the  sets  A  j  would  not  build  up.  On  the  other  hand,  in  such  a  case 
it  would  be  irrelevant  which  entries  to  pick. 
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8  Appendix:  random  matrices  satisfying  PI  and  P2 

In  this  section,  we  prove  the  validity  of  PI  and  P2  for  three  standard  examples  of  random 
matrices: 

(i)  Gaussian:  the  entries  <3>y  are  i.i.d.  centered  Gaussian  variables  of  variance  1  jn. 

(ii)  Bernoulli:  the  entries  are  i.i.d.  Bernoulli  variables  with  values  ±1  j\fn. 

(iii)  Unit  vectors:  the  columns  are  i.i.d.  under  the  uniform  law  on  the  n-dimensional 
sphere. 

We  also  show  that  PI  implies  RIP. 

8.1  Proof  of  PI 

By  linearity  it  is  sufficient  to  consider  a  vector  x  of  norm  1  and  therefore  evaluate 
Prob{|||<I>a:||2  —  1|  >  h}. 

The  validity  of  PI  for  Gaussian  and  Bernoulli  matrices  is  a  consequence  of  Lemma 

6.1  of  [12],  which  establishes  this  property  for  a  more  general  class  of  random  matrices 
with  i.i.d.  entries  that  have  a  subgaussian  distribution. 

For  matrices  consisting  of  random  unit  vectors,  we  notice  that  the  function 

N 

M{<j>u  ■■■An)  =  W^XjtjW  =  IMI 

3= 1 

is  a  Lip  1  function  from  (IR11)N  to  1R  since 

N  N 

|M(0!,  •  •  -An)  I  <  II  ~  4>'j)\\  <  ' 

3= 1  i=1 

The  uniform  product  measure  on  an  TV-fold  tensor  product  of  Sn_ \  by  itself  has  the  same 
concentration  function  of  the  form  e_^n_1^2//2  as  »5'n_ \  (see  [18])  and  therefore,  using  the 
notation  X  =  ||<F:r|| 

ProbjjX  -fi\>6^<  2e-{n~1)s2/2, 

where  //  is  the  median  of  X.  It  is  easy  to  derive  from  this  a  similar  estimation  where 
the  median  /i  is  replaced  by  the  average  E(X):  assume  that  X  is  a  random  variable  that 
satisfies 

Prob(|X  —  fx\  >5)  <ae~bs2, 

with  a  >  1.  Then,  assuming  without  loss  of  generality  that  E(X)  >  /i,  we  obtain 

+oo 

E(X)-»<  f  ae-l‘>  = 'if- . 

0 
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Since 


It  follows  that  Prob(|A"  —  E(X)\  >  5)  <  Prob(| A  —  n\  >  |)  when  <  §. 

Prob(|A"  —  E(X)\  >  5)  <  1,  we  have  for  any  5  >  0, 

ProbQX  -  E(X) |  >  5)  <  ae~bs2/4,  a  =  e™2/ 4, 


where  we  have  used  that  a  >  1.  In  the  present  case  A"  =  ||<ha;||,  this  gives 
Prob(jX  -  E(X) |  >  h)  <  e7re_(n_1)52/4  <  27e~{n~1)52/4 
By  integration,  we  obtain 


E(\X  -  E(X)\2)  <  54 


+oo 

j  te-{n~1)t2/4dt 
o 


108 
n  —  1 


Since  £i(||$x||2)  =  ||a;||2  =  1,  we  thus  have 


0  <  1  -  E(X)2 


E(X2)  -  E(X)2 


E{\X-E{X)\2)  < 


108 
n  —  1’ 


which  implies 


0  <  1  -  E{X)  < 


108 
n  —  1 


It  follows  from  (8.1)  and  (8.2)  that  when  5  > 


Prob(jX  -  1|  >  <$)  <  ProbjjX  -  E(X)\  >  5/ 2)  <  27e~{n~1)52/w. 
On  the  other  hand,  if  <5  <  Alb  we  have 

’  —  n—1  ’ 


Prob(|X  -  1|  >  <*)  <  1  <  e^e-(n-1)52/w. 
Therefore,  in  all  cases  we  have 

Prob(|X  -  1|  >  <  C0e-(n~1)<52/16, 


with  C0  =  nrax{27,  e™-1 }.  We  conclude  by 

Prob(|  ||4>x||2  -  1|  >  5)  =  Prob(|A2  -  1|  >  6) 

<  Prob(|A  -  1|  >  5/3)  +  Prob(|A  +  1|  >  3) 

<  Prob(|A  -  lj  >  5/3)  +  Prob(|X  -  1|  >  1) 

<  2C'0e“(n_1)'s2/144, 


(8.1) 


(8.2) 


for  all  0  <  5  <  1,  which  shows  that  PI  holds. 
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8.2  Proof  of  P2 


Again,  by  linearity,  it  is  sufficient  to  consider  a  vector  z  of  norm  1  and  evaluate  Prob{|  (z,  (pi)  \  > 

5}. 

For  Gaussian  matrices,  we  note  that  (z,  4>i)  =  ]G/=]  z^^i  is  a  centered  Gaussian  vari¬ 
able  with  variance  1/n.  From  this  it  follows  that 

Prob{|(z,  (j)i)  |  >  8}  =  -j=  I  e~t2/ 2  dt  <  ~^= e~nS2/ 2  J  dt  =  e~nS2/ 2 

t>y/nS  t>y/nS 

(8.3) 

Therefore  P2  holds  with  b2  =  1  and  c2  =  1/2. 

For  Bernoulli  matrices,  we  invoke  Hoeffding’s  inequality  [16],  which  states  that  for  a 
sequence  of  independent  variables  vt  with  mean  zero  and  such  that  w/  <  m,  almost 
surely,  one  has 

<52 

Prob 1 1  Vj\  >  8 1  <  2e  2Em< .  (8.4) 

It  follows  that 

_  s2 

Prob{|(^,  4>i)\  >d}<2e  2E^  =  2e~n52/2.  (8.5) 

Therefore  P2  holds  with  b2  =  2  and  c2  =  1/2. 

For  matrices  consisting  of  random  unit  vectors,  Prob{|(z,  </>/)(  >  5}  is  the  ratio  between 
the  measure  of  the  set  S^n- 1  :=  {x  G  Sn- 1,  \  {z,x)\  >  5}  and  the  measure  of  the  whole 
sphere  Sn-\.  It  is  well  known  that  this  ratio  tends  exponentially  to  Oasn->  +00.  More 
precisely,  using  that  the  uniform  measure  on  An_  1  has  a  concentration  function  of  the 
form  e-G-i)<52/2  ("see  [17] );  one  obtains  that 

Prob{ | </>z)  |  >  <5}  <  2e_^n_1)<52^2.  (8.6) 

Since,  the  inner  product  is  clearly  <  1,  we  only  need  to  consider  8  <  1,  in  which  case  we 
find  that  P2  holds  with  b2  =  2  and  c2  =  1/4  provided  n  >  2. 

8.3  PI  implies  RIP 

The  restricted  isometry  property  RIP.  introduced  by  Candes,  Tao  and  Romberg  [8,  9] 
states  that  acts  close  to  an  isometry  on  all  m  sparse  vector.  In  other  words,  for  some 
0  <  rj  <  1, 

(l-r7)|M<||^r||  <(1  +  17)11^11,  x  e  RN ,  \T\  <  k.  (8.7) 

Following  the  approach  in  [3],  we  shall  prove  that  PI  implies  that  (2.3)  holds  with  prob¬ 
ability  larger  than  1  —  51e_nJ++m[logG7V/m)+logP2/''?)]. 

We  first  fix  T  such  that  |T|  <  N  and  consider  the  set  XT  of  TV-dimensional  vectors 
with  support  contained  in  T.  We  shall  prove  that  PI  implies  the  validity  of 

(1  -  77)||x||  <  ||$x||  <  (1 +77)||x|[t  x  e  Xt,  (8.8) 
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with  probability  larger  than  1  —  61(12/?7)me_Cimr/4.  Since  the  number  of  sets  T  of  car¬ 
dinality  m  is  (^)  <  (eN/m)m,  it  will  follow  from  the  union  bound  that  (2.3)  holds  with 
probability  larger  than  1  —  bi(eN/m)m(12/r])rne~cimr^4:,  which  is  the  announced  result. 

We  choose  a  finite  set  Q  C  Xt  such  that  ||g||  =  1  for  all  q  G  Q  and  such  that  for  all 
x  G  Xt  with  ||a;||  =  1,  there  exists  q  G  Q  with  ||a;  —  g||  <  q/4.  ft  is  well  known  from 
covering  numbers  that  one  can  choose  such  a  set  Q  with  \Q\  <  (12/rj)m  (see  e.g.  Chapter 
13  of  [19]). 

Invoking  PI  and  a  union  bound,  we  thus  obtain  that  with  probability  larger  than 
1  —  &i(12/?/)me-cinT?2/4,  we  have  for  all  q  G  Q. 

(i-v/2)\\qr<\m2<(i+v/2)\\qr,  (8.9) 

which  trivially  gives 

(l-ri/2)\\q\\  <  ||$g||  <  [l  +  g  /  2)\\q\\, 

Denoting  by  M  the  norm  of  restricted  to  XT,  we  derive  from  the 
the  covering  property  of  Q, 

M  =  sups6XT)||a.||<1  \\$x\\ 

<  supa.6XTiW<iiiifgeQ(||$g||  +  ||$(a:  -  q) ||) 

<  1  +  r)/2  +  M supa.eXTi||a.||<1  ||a;  -  g|| 

<  1  +  77/2  +  Mrj/4. 

It  follows  that  M  <  (1  +  rj/2) /(I  —  77/ 4)  <  1  +  r]  which  gives  the  upper  inequality  in  (8.8). 
The  lower  inequality  follows  from  it  since  for  all  x  G  XT  with  ||a;||  =  1  and  q  G  Q  such 
that  ||a;  —  g||  <  77/4,  we  have 

||$x||  >  ||<hg||  —  ||$(x  —  g)||  >  1  —  77/2  —  (1  +  77)77/4  >  1  —  77.  (8.11) 

By  linearity,  the  lower  inequality  is  thus  proved  for  any  x  G  Xt- 
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upper  inequality  and 
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